htseq-countThis particular work is just me exploring the DESeq2 workflow with our data and may not be scientifically accurate. The workflow is described here as SVN Revision: 105538. Over the next few days I’ll be improving and testing on this workflow.
If you’re unsure that you have all the pacakges required to run this workflow. Open the Rmd file in your favorite text editor (I used RStudio) and change the next line from eval=FALSE to eval=TRUE. Now, when you run this workflow, the dependencies should be installed first.
htseq-countSample command:
htseq-count -f bam -r name -t CDS -o scaffold.htseq.sam -i ID -q scaffold_sortedByName.bam all_combined.gff
This command was run for each sample individually.
I performed a self blast and looked at results that had a percent identity greater than 98%, query coverage greater than 96% and a minimum alignment length of 500 bases. Once I had this subset, I screened out the hits to exons since we won’t be considering them for this experiment anyway. I was left with the following two gene pairs:
| Query | Match |
|---|---|
| scaffold_344578__MIS_1109813.1 | scaffold_219988__MIS_10179608.12 |
| scaffold_133898__MIS_10093600.14 | scaffold_555373__MIS_1172265.1 |
that had high enough similarity based on the thresholds mentioned above that their count data needed to be merged. The perl script mergeCounts.pl was run on each htseq-count output individually in order to accomplish this. Here is a sample command used for one of the htseq-count outputs:
perl mergeCounts.pl -l realDuplicateGenes.list -tsv Day_1.htseqCount.tsv -o Day_1.htseqCount.merged.tsv
where, realDuplicateGenes.list contains the two gene pairs mentioned above.
Once we were satisfied with the genes and their counts. We imported the count data into DESeq2.
## Day_1 Day_2 Day_3 Night_4 Night_5 Night_6
## 2739735 492104 691689 1105737 1587917 969992
Get rid of genes which did not occur frequently enough. Here we say, lets get rid of genes with counts >=2 in at least 2 samples.
## Day_1 Day_2 Day_3 Night_4 Night_5 Night_6
## 2705284 480844 679334 1091297 1547976 941046
## Day_1 Day_2 Day_3 Night_4 Night_5 Night_6
## 34451 11260 12355 14440 39941 28946
This reduces the dataset from 1464832 tags to about 12089. For the filtered tags, there is very little power to detect differential expression, so little information is lost by filtering.
rlog transformationMany common statistical methods for exploratory analysis of multidimensional data, especially methods for clustering and ordination (e.g., principal-component analysis and the like), work best for (at least approximately) homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean. In RNA-Seq data, however, variance grows with the mean. For example, if one performs PCA (principal components analysis) directly on a matrix of normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with low counts tend to dominate the results because, due to the strong Poisson noise inherent to small count values, they show the strongest relative differences between samples.
As a solution, DESeq2 offers the regularized-logarithm transformation, or rlog for short. For genes with high counts, the rlog transformation differs not much from an ordinary log2 transformation. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. Using an empirical Bayesian prior on inter-sample differences in the form of a ridge penalty, this is done such that the rlog-transformed data are approximately homoskedastic.
Note: the rlog transformation is provided for applications other than differential testing. For differential testing we recommend the DESeq function applied to raw counts, as described later in this workflow, which also takes into account the dependence of the variance of counts on the mean value during the dispersion estimation step.
A useful first step in an RNA-Seq analysis is often to assess overall similarity between samples: Which samples are similar to each other, which are different? Does this fit to the expectation from the experiment’s design? We use the R function dist to calculate the Euclidean distance between samples. To avoid that the distance measure is dominated by a few highly variable genes, and have a roughly equal contribution from all genes, we use it on the rlog-transformed data:
## Day_1 Day_2 Day_3 Night_4 Night_5
## Day_2 81.50580
## Day_3 93.10733 89.78260
## Night_4 98.80173 95.10344 95.03750
## Night_5 63.41866 83.03560 87.19398 96.62408
## Night_6 73.46082 83.81028 79.55655 98.84027 68.63069
We visualize the distances in a heatmap:
Another option for calculating sample distances is to use the Poisson Distance, implemented in the CRAN package PoiClaClu. Similar to the transformations offered in DESeq2, this measure of dissimilarity also takes the variance structure of counts into consideration when calculating the distances between samples. The PoissonDistance function takes the original count matrix (not normalized) with samples as rows instead of columns, so we need to transpose the counts in dds.
Another way to visualize sample-to-sample distances is a principal-components analysis (PCA). In this ordination method, the data points (i.e., here, the samples) are projected onto the 2D plane such that they spread out in the two directions which explain most of the differences in the data. The x-axis is the direction (or principal component) which separates the data points the most. The amount of the total variance which is contained in the direction is printed in the axis label.
Another plot, very similar to the PCA plot, can be made using the multidimensional scaling (MDS) function in base R. This is useful when we don’t have the original data, but only a matrix of distances. Here we have the MDS plot for the distances calculated from the rlog transformed counts:
And here from the PoissonDistance:
In order to normalise the raw counts we will start by determining the relative library sizes, or size factors for each library. For example, if the counts of the expressed genes in one sample are, on average, twice as high as in another, the size factor for the first sample should be twice as large as the one for the other sample. These size factors can be obtained with the function estimateSizeFactors:
## Day_1 Day_2 Day_3 Night_4 Night_5 Night_6
## 1.6219205 0.6622708 0.5909770 0.3381402 2.2756803 2.5534369
## null device
## 1
Once we have this information, the normalised data is obtained by dividing each column of the count table by the corresponding size factor. We can perform this calculation by calling the function counts with a the normalized argument set as TRUE. Since we won’t be normalizing this data, we’ll set it as FALSE
## Day_1 Day_2 Day_3 Night_4 Night_5
## scaffold_0__MIS_10000001.165 3.699318 1.509956 8.460567 8.87206 8.3491517
## scaffold_0__MIS_10000001.177 1.233106 0.000000 0.000000 0.00000 3.5154323
## scaffold_0__MIS_10000001.23 0.000000 0.000000 0.000000 0.00000 0.8788581
## scaffold_0__MIS_10000001.267 1.849659 0.000000 0.000000 0.00000 0.8788581
## scaffold_0__MIS_10000001.329 1.233106 0.000000 0.000000 0.00000 0.8788581
## scaffold_0__MIS_10000001.439 1.849659 0.000000 0.000000 0.00000 0.4394290
## Night_6
## scaffold_0__MIS_10000001.165 1.958145
## scaffold_0__MIS_10000001.177 0.000000
## scaffold_0__MIS_10000001.23 4.307919
## scaffold_0__MIS_10000001.267 4.307919
## scaffold_0__MIS_10000001.329 3.133032
## scaffold_0__MIS_10000001.439 1.566516
## Day_1 Day_2 Day_3
## Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.6 1st Qu.: 0.00 1st Qu.: 0.0
## Median : 1.8 Median : 0.00 Median : 0.0
## Mean : 138.0 Mean : 60.06 Mean : 95.1
## 3rd Qu.: 3.7 3rd Qu.: 4.53 3rd Qu.: 3.4
## Max. :654306.4 Max. :152886.09 Max. :401871.8
## Night_4 Night_5 Night_6
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 0.44 1st Qu.: 0.39
## Median : 0.0 Median : 1.32 Median : 1.17
## Mean : 267.0 Mean : 56.27 Mean : 30.49
## 3rd Qu.: 5.9 3rd Qu.: 3.52 3rd Qu.: 3.13
## Max. :1312257.5 Max. :229977.38 Max. :43065.88
## Day_1 Day_2 Day_3 Night_4 Night_5 Night_6
## scaffold_0__MIS_10000001.165 6 1 5 3 19 5
## scaffold_0__MIS_10000001.177 2 0 0 0 8 0
## scaffold_0__MIS_10000001.23 0 0 0 0 2 11
## scaffold_0__MIS_10000001.267 3 0 0 0 2 11
## scaffold_0__MIS_10000001.329 2 0 0 0 2 8
## scaffold_0__MIS_10000001.439 3 0 0 0 1 4
## Day_1 Day_2 Day_3
## Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.0 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 3.0 Median : 0.00 Median : 0.00
## Mean : 223.8 Mean : 39.78 Mean : 56.19
## 3rd Qu.: 6.0 3rd Qu.: 3.00 3rd Qu.: 2.00
## Max. :1061233.0 Max. :101252.00 Max. :237497.00
## Night_4 Night_5 Night_6
## Min. : 0.0 Min. : 0 Min. : 0.00
## 1st Qu.: 0.0 1st Qu.: 1 1st Qu.: 1.00
## Median : 0.0 Median : 3 Median : 3.00
## Mean : 90.3 Mean : 128 Mean : 77.84
## 3rd Qu.: 2.0 3rd Qu.: 8 3rd Qu.: 8.00
## Max. :443727.0 Max. :523355 Max. :109966.00
Plotting Rank Abundance for top 500 genes.
Differential expression was calculater using the DESeq2 wrapper function over 4 processors.
Using package sva. Here is how the package has been described:
The sva package contains functions for removing batch effects and other unwanted variation in high-throughput experiment. Specifically, the sva package contains functions for the identifying and building surrogate variables for high-dimensional data sets. Surrogate variables are covariates constructed directly from high-dimensional data (like gene expression/RNA sequencing/methylation/brain imaging data) that can be used in subsequent analyses to adjust for unknown, unmodeled, or latent sources of noise. The sva package can be used to remove artifacts in three ways: (1) identifying and estimating surrogate variables for unknown sources of variation in high-throughput experiments (Leek and Storey 2007 PLoS Genetics,2008 PNAS), (2) directly removing known batch effects using ComBat (Johnson et al. 2007 Biostatistics) and (3) removing batch effects with known control probes (Leek 2014 biorXiv). Removing batch effects and using surrogate variables in differential expression analysis have been shown to reduce dependence, stabilize error rate estimates, and improve reproducibility, see (Leek and Storey 2007 PLoS Genetics, 2008 PNAS or Leek et al. 2011 Nat. Reviews Genetics).
## Number of significant surrogate variables is: 2
## Iteration (out of 5 ):1 2 3 4 5
As res is a DataFrame object, it carries metadata with information on the meaning of the columns:
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized counts for all samples
## log2FoldChange results log2 fold change (MAP): condition Day vs Night
## lfcSE results standard error: condition Day vs Night
## stat results Wald statistic: condition Day vs Night
## pvalue results Wald test p-value: condition Day vs Night
## padj results BH adjusted p-values
##
## out of 12089 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 3, 0.025%
## LFC < 0 (down) : 0, 0%
## outliers [1] : 78, 0.65%
## low counts [2] : 10871, 90%
## (mean count < 10.9)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
## DataFrame with 6 rows and 2 columns
## type description
## <character> <character>
## baseMean intermediate mean of normalized counts for all samples
## log2FoldChange results log2 fold change (MAP): condition Day vs Night
## lfcSE results standard error: condition Day vs Night
## stat results Wald statistic: condition Day vs Night
## pvalue results Wald test p-value: condition Day vs Night
## padj results BH adjusted p-values
##
## out of 12089 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 61, 0.5%
## LFC < 0 (down) : 8, 0.066%
## outliers [1] : 0, 0%
## low counts [2] : 11484, 95%
## (mean count < 25.1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Novices in high-throughput biology often assume that thresholding these p values at a low value, say 0.05, as is often done in other settings, would be appropriate – but it is not. We briefly explain why: There are 193 genes with a p value below 0.05 among the 12089 genes, for which the test succeeded in reporting a p value.
Now, assume for a moment that the null hypothesis is true for all genes, i.e., no gene is affected by the treatment with dexamethasone. Then, by the definition of p value, we expect up to 5% of the genes to have a p value below 0.05. This amounts to 604 genes. If we just considered the list of genes with a p value below 0.05 as differentially expressed, this list should therefore be expected to contain up to 604.45/193=313.1865285% false positives.
DESeq2 uses the Benjamini-Hochberg (BH) adjustment as described in the base R p.adjust function; in brief, this method calculates for each gene an adjusted p value which answers the following question: if one called significant all genes with a p value less than or equal to this gene’s p value threshold, what would be the fraction of false positives (the false discovery rate, FDR) among them (in the sense of the calculation outlined above)? These values, called the BH-adjusted p values, are given in the column padj of the res object. Hence, if we consider a fraction of 10% false positives acceptable, we can consider all genes with an adjusted p value below 10% = 0.1 as significant. How many such genes are there?
## [1] 69
We subset the results table to these genes and then sort it by the log2 fold change estimate to get the significant genes with the strongest down-regulation.
## log2 fold change (MAP): condition Day vs Night
## Wald test p-value: condition Day vs Night
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## scaffold_247603__MIS_1019830.5 48.50516 -4.385639 1.1798426
## scaffold_762984__MIS_10200131.2 7636.65755 -3.871386 1.0265900
## scaffold_39942__MIS_10004005.1 276.22010 -3.429412 0.9464202
## scaffold_83360__MIS_10039751.14 183.09975 -2.921465 0.9417198
## scaffold_762984__MIS_10200131.6 180.97034 -2.810339 0.9217688
## scaffold_109736__MIS_10069576.1 70.80729 -2.556002 1.0003326
## stat pvalue padj
## <numeric> <numeric> <numeric>
## scaffold_247603__MIS_1019830.5 -3.717139 0.0002014918 0.004376742
## scaffold_762984__MIS_10200131.2 -3.771112 0.0001625219 0.004007650
## scaffold_39942__MIS_10004005.1 -3.623562 0.0002905734 0.005670869
## scaffold_83360__MIS_10039751.14 -3.102265 0.0019204584 0.026406303
## scaffold_762984__MIS_10200131.6 -3.048855 0.0022971536 0.030212564
## scaffold_109736__MIS_10069576.1 -2.555153 0.0106141265 0.093065892
…and with the strongest upregulation.
## log2 fold change (MAP): condition Day vs Night
## Wald test p-value: condition Day vs Night
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## scaffold_200891__MIS_10160517.7 202.78263 5.323325 1.0677017
## scaffold_12010__MIS_10012011.12 158.62925 5.252253 0.8786546
## scaffold_12010__MIS_10012011.1 45.50202 4.771448 1.0245247
## scaffold_181056__MIS_10140692.2 750.19279 4.654385 0.9769245
## scaffold_42417__MIS_10006030.1 34.16932 4.463649 1.1039401
## scaffold_7608__MIS_10007609.1 100.82771 4.264579 0.7868240
## stat pvalue padj
## <numeric> <numeric> <numeric>
## scaffold_200891__MIS_10160517.7 4.985779 6.171253e-07 1.146885e-04
## scaffold_12010__MIS_10012011.12 5.977609 2.264369e-09 1.369943e-06
## scaffold_12010__MIS_10012011.1 4.657231 3.204913e-06 3.231621e-04
## scaffold_181056__MIS_10140692.2 4.764324 1.894874e-06 2.292798e-04
## scaffold_42417__MIS_10006030.1 4.043380 5.268619e-05 2.125010e-03
## scaffold_7608__MIS_10007609.1 5.419991 5.960199e-08 1.802960e-05
A quick way to visualize the counts for a particular gene is to use the plotCounts function, which takes as arguments the DESeqDataSet, a gene name, and the group over which to plot the counts.
An “MA-plot” provides a useful overview for an experiment with a two-group comparison. The log2 fold change for a particular comparison is plotted on the y-axis and the average of the counts normalized by size factor is shown on the x-axis (“M” for minus, because a log ratio is equal to log minus log, and “A” for average).
Each gene is represented with a dot. Genes with an adjusted p-value below a threshold (here 0.1, the default) are shown in red. The DESeq2 package incorporates a prior on log2 fold changes, resulting in moderated log2 fold changes from genes with low counts and highly variable counts, as can be seen by the narrowing of spread of points on the left side of the plot. This plot demonstrates that only genes with a large average normalized count contain sufficient information to yield a significant call.
Whether a gene is called significant depends not only on its LFC but also on its within-group variability, which DESeq2 quantifies as the dispersion. For strongly expressed genes, the dispersion can be understood as a squared coefficient of variation: a dispersion value of 0.01 means that the gene’s expression tends to differ by typically sqrt(0.01)=10% between samples of the same treatment group. For weak genes, the Poisson noise is an additional source of noise.
The function plotDispEsts visualizes DESeq2’s dispersion estimates: